-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 509 #2333
base: main
Are you sure you want to change the base?
Conversation
For more information, see https://pre-commit.ci
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## dev #2333 +/- ##
=====================================
Coverage 88.4% 88.4%
=====================================
Files 87 87
Lines 10139 10139
=====================================
Hits 8971 8971
Misses 1168 1168
☔ View full report in Codecov by Sentry. |
src/pudl/transform/eia.py
Outdated
eia_transformed_dfs["generation_fuel_eia923"] = eia_transformed_dfs[ | ||
"generation_fuel_eia923" | ||
].drop(columns=["utility_name_eia"]) | ||
eia_transformed_dfs["generation_fuel_nuclear_eia923"] = eia_transformed_dfs[ | ||
"generation_fuel_nuclear_eia923" | ||
].drop(columns=["utility_name_eia"]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little surprised that this is necessary. I thought that by default the harvesting process dropped any column that was harvested from the table it was found in (unless we specifically tell it not to rip the column out). But I could be wrong about this @cmgosnell would probably know better.
One option to deal with the issue of leftover columns that need to be harvested, but shouldn't stick around in a generic way is to use the pudl.metadata.classes.Resource.format_df()
method, as we are in pudl.transform.classes.AbstractTableTransformer.enforce_schema()
.
In fact, copying enforce_schema()
almost verbatim into a new function might be the way to go. I think the only big difference is you'd need to look up the appropriate resource using the table name that is the key in that dictionary of dataframes, rather than having self.table_id
to work with.
Eventually the plan is to refactor the EIA transforms to use the same class design that we've applied to FERC 1, but for now just iterating over all the finished dataframes at the very end of the EIA transformation process would avoid the need to special case out which columns should be present or not, and do some other basic checks that will cause the DB to complain (no null PKs, unique PKs, etc).
…source, so it can be used from elsewhere
For more information, see https://pre-commit.ci
…Frames before converting to SQLite
For more information, see https://pre-commit.ci
src/pudl/transform/eia.py
Outdated
for cat in eia_transformed_dfs: | ||
resource = ( | ||
pudl.metadata.classes.Package.from_resource_ids(). | ||
get_resource(cat) | ||
) | ||
eia_transformed_dfs[cat] = resource.enforce_schema( | ||
eia_transformed_dfs[cat]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to wait to do this until the very end (after the balancing authority fix below)?
Is there any reason not to call enforce_schema()
on the entity tables too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had already moved the code to the very end of the function.
And I added code to do the same thing for the entity tables. That doesn't seem to cause any distress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow that's great! I'm slightly surprised. Do you want to try running tox -e nuke
overnight and see if stays happy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems not successful:
nuke: exit -9 (641.56 seconds) /home/knordback/Kurt/pudl> bash -c 'coverage run --append src/pudl/cli.py --logfile tox-nuke.log --clobber src/pudl/package_data/settings/etl_full.yml' pid=793582
.pkg: _exit> python /home/knordback/.conda/envs/pudl-dev/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta
nuke: FAIL code -9 (7080.40=setup[98.70]+cmd[0.12,0.54,0.48,0.39,0.41,0.42,0.39,0.53,0.24,6.33,6.21,1.24,0.03,1.72,1.23,161.77,33.89,25.29,3980.21,0.01,2118.71,641.56] seconds)
evaluation failed :( (7081.04 seconds)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of those tests run successfully. So does tox
with no arguments (which I thought was the commit criterion). So it seems to be something specific to the "nuke".
I am indeed getting copious output, but it's not (to my eye) indicating what the problem is. As mentioned above, I get some warnings but nothing obviously connected to this.
I'll be away next week. I'll will look into this more when I get back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Zane. I'm back, for a little while, and have some time to work on this again. I can continue un-dropping fields, but I'm not sure how worried to be about not being able to run tox -e nuke
successfully. As noted, the other tests run fine for me. Output is attached in case you want to take a look.
tox-out.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried running the full ETL with the pudl_etl
command line tool? The place that it's failing in your logs is the most memory intensive part of the ETL so my guess is it's running out of memory and crashing? I thought you'd been able to do the full ETL previously? How much memory + swap do you have? I think it may take up to 25GB as it is now.
tox
with no arguments is what gets run in CI on GitHub. It only processes 1 year of data. For changes like the one you're making that will affect all years of data we try and run the full ETL locally before it gets merged into dev
.
tox -e nuke
is a blunt instrument that not only runs the full ETL, but also all the unit + integration tests on all the data + data validation tests. You probably want to be trying to run the full ETL, but only for the EIA data, to check whether everything is working. The devtools/eia-etl-debug.ipynb
notebook is the easiest / most efficient way to do that right now, since you don't have to run the extract step over and over again. It'll also avoid this (FERC 1) related memory intensive step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 16GB ram + 15GB swap. I thought I could run the full ETL but apparently not that either. I was able to run devtools/eia-etl-debug.ipynb
(first time using Jupyter -- it's kinda slick).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jupyter is great for working with data. Lots more human-friendly tools for seeing what's going on in there! Definitely worth getting familiar with.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than removing the null df.drop()
I think this looks great!
Do you want to go ahead and merge dev
in and try to run the full ETL + data validations to see if anything is broken?
src/pudl/transform/eia923.py
Outdated
"net_generation_mwh_year_to_date", | ||
"early_release", | ||
], | ||
[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just remove this .drop()
altogether.
If you've got the full DB already populated locally, you can run the full data tests and validations in parallel in two different windows with these commands from the main repo directory: pytest --live-dbs test/validate
pytest --live-dbs --etl-settings src/pudl/package_data/settings/etl_full.yml test/integration |
With the current code I get this:
Does this make sense ? Is there an easy way to look at the diffs? |
Most of these changes are small increases in the number of rows which seems pretty reasonable. Changes that sounds out as maybe funny:
@cmgosnell do you have any thoughts on why these tables would have such big changes just because we're not dropping "extra" columns before harvesting? The utilities make some sense, but the FRC and generation seem surprising. To diff the tables, there's sqldiff but it looks like that may only be a Windows utility. I think @rousik or @zschira were doing some DB table diffing, maybe with another utility? I would probably read the two tables into pandas and set the index to the primary key and see what records exist in one table but not the other by taking the difference between the two indexes. But that won't work well on the FRC table since it has no natural primary key. |
@knordback it looks like @rousik has been doing a bunch of work diffing SQL tables and this might be another good test of his little toolkit too. |
Okay, that would be good. I get a lot of output just running |
Do you have the outputs on gcs, or can you tell me how to repro the etl run
in question? I could test my scripts against this case.
…On Fri, Jun 2, 2023, 05:06 knordback ***@***.***> wrote:
@knordback <https://github.com/knordback> it looks like @rousik
<https://github.com/rousik> has been doing a bunch of work diffing SQL
tables and this might be another good test of his little toolkit too.
Okay, that would be good. I get a lot of output just running sqldiff, and
I don't know how to extract the essential difference.
—
Reply to this email directly, view it on GitHub
<#2333 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABYSCGGHM45YV6HFPBLDI5DXJFKC5ANCNFSM6AAAAAAVI5CXOE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'm wanting to compare the pudl.sqlite produced on dev against that produced in the bug-509 branch. (I just merged from dev to the branch, so the diffs should only correspond to the branch changes.) I also have both locally and could put them on Google. Is there a particular place this kind of thing should go? |
All right, I have ran bug-509 branch vs dev and here's the output diff report:
The positive number means that that many rows were added in the branch bug-509 (right side of the comparison). The relevant piece of code you can use to fetch the differing rows would be something along the lines of: def read_table_as_df(db_path, table_name):
con = create_engine(db_path)
return pd.concat([
df for df in pd.read_sql_table(table_name, con, chunksize=100_000
])
df_left = read_table_as_df(left_db, table)
df_right = read_table_as_df(right_db, table)
df_merge = df_left.merge(df_right, how="outer", indicator=True)
# Then, df_merge["_merge"] has values "left_only", "right_only" or "both" depending on where it occurs |
Ah, this is super helpful. More rows in bug-509 is what we would expect. Which version of the ETL did you run to get this? |
Latest dev available at the time and latest commit of bug-509. rousik-output-diff branch has a tool "output_diff" you could point at two directories and it will produce the report for you. I'll be working on automating this. |
Huh, those differences in the number of rows seem pretty different from what @knordback got when he ran the |
It depends on the ETL version used to create the pudl.sqlite, right? So I'm running modified and reference versions of etl_full_no_cems and will try @rousik 's tool on the output. |
If you've got two different DBs side by side and you want to see differences between the outputs they generate you can do something like the following in a Jupyter Notebook (here looking at the import pandas as pd
import sqlalchemy as sa
from pudl.metadata.classes import Resource
from pudl.output.pudltabl import PudlTabl
pk_cols = Resource.from_id("plant_in_service_ferc1").schema.primary_key
# Use the real paths to your 2 DBs obvs:
left_db_url = "sqlite:///" + "/Users/zane/code/catalyst/pudl-work/output/pudl.sqlite"
left_engine = sa.create_engine(left_db_url)
left_pudl_out = PudlTabl(left_engine)
left_df = pudl_out.plant_in_service_ferc1().set_index(pk_cols)
# Use the real paths to your 2 DBs obvs:
right_db_url = "sqlite:///" + "/Users/zane/code/catalyst/pudl-work/output/pudl.sqlite"
right_engine = sa.create_engine(right_db_url)
right_pudl_out = PudlTabl(right_engine)
right_df = pudl_out.plant_in_service_ferc1().set_index(pk_cols)
left_only_index = left_df.index.difference(right_df.index)
right_only_index = right_df.index.difference(left_df.index)
left_only_df = left_df.loc[left_only_index]
right_only_df = right_df.loc[right_only_index] |
Remarkably, I finally got back to this. I ran the bug-509 code and compared output against dev. To compare, I first ran Jan's output_diff tool to find pudl.sqlite tables that differ. It produced the following:
(It also found differences in ferc2.sqlite, but I think that those occur because resource constraints on my machine meant that not everything could run.) Anyhow, in the output above I chose table differences that I felt pretty confident I could identify in So this is all worrisome and it appears maybe something is going wrong. It would be good to get a quick check of this to see if maybe this is better than it appears to me, of if I need to investigate more deeply. |
I'm not sure this is necessarily concerning. You're comparing the I think it makes sense to start with one of the "base" tables rather than derived tables that are several steps downstream, since the tables that are the immediate outputs from the harvesting are going to be the first places that the consequences of the new process appear, and everything downstream will be a result of those changes. The outputs of the harvesting process are the entity and annual tables for plants, generators, boilers, and utilities:
You might also want to pull the data directly from the database(s) rather than using |
Here's a selection of the differences I'm seeing. This is not comprehensive, as some of the tables have many differences and I haven't looked at all of them.
The following files show the differences. (I should have named them better: *-left.txt come from dev; *-right.txt come from bug-509 branch.) These are generated with code that loads the created |
This is a proposed way of dealing with fields that are in the input data but we're no longer dropping in the eia923
transform()
functions, but aren't in the corresponding SQLite tables. Basically, I'm just stripping out the columns after the harvesting step.In ToT code, the fields are dropped so they never appear in the DataFrames. If we don't drop them, then leaving the corresponding columns in results in a crash when trying to populate the SQLite DB.
In this code, the relevant field I'm dealing with is
utility_name_eia,
which comes in asoperator_name
. This is in thegeneration_fuel_eia923
table, but gets duplicated ingeneration_fuel_nuclear_eia923
. So it needs to be removed from both.This approach feels very special-case-y. But maybe that's the nature of the beast. If the general approach seems okay and other currently-dropped fields need to be handled in the same way, then I'd probably at least pull the new code out into a separate function.